[SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat #15813

JoshRosen · 2016-11-08T20:00:04Z

What changes were proposed in this pull request?

This patch significantly improves the IO / file listing performance of schema inference in Spark's built-in CSV data source.

Previously, this data source used the legacy SparkContext.hadoopFile and SparkContext.hadoopRDD methods to read files during its schema inference step, causing huge file-listing bottlenecks on the driver.

This patch refactors this logic to use Spark SQL's text data source to read files during this step. The text data source still performs some unnecessary file listing (since in theory we already have resolved the table prior to schema inference and therefore should be able to scan without performing any extra listing), but that listing is much faster and takes place in parallel. In one production workload operating over tens of thousands of files, this change managed to reduce schema inference time from 7 minutes to 2 minutes.

A similar problem also affects the JSON file format and this patch originally fixed that as well, but I've decided to split that change into a separate patch so as not to conflict with changes in another JSON PR.

How was this patch tested?

Existing unit tests, plus manual benchmarking on a production workload.

SparkQA · 2016-11-08T21:57:17Z

Test build #68359 has finished for PR 15813 at commit acce60d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-09T01:46:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+          paths = inputPaths,
+          className = classOf[TextFileFormat].getName
+        ).resolveRelation(checkFilesExist = false))
+        .select("value").as[String](Encoders.STRING)


Hi @JoshRosen, I just happened to look at this one and I am just curious. IIUC, the schema from the sparkSession.baseRelationToDataFrame will always has only value column not including partitioned columns (it is empty and also inputPaths will be always leaf files).

So, my question is, is that .select("value") used just to doubly make sure? Just curious.

I copied this logic from the text method in DataFrameReader, so that's where the value came from.

JoshRosen · 2016-11-10T00:30:13Z

Jenkins, retest this please

SparkQA · 2016-11-10T01:56:02Z

Test build #68432 has finished for PR 15813 at commit acce60d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-11-15T21:25:38Z

Jenkins, retest this please

SparkQA · 2016-11-15T23:45:22Z

Test build #68674 has finished for PR 15813 at commit acce60d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-17T23:14:34Z

Test build #68800 has finished for PR 15813 at commit 3082844.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-11-18T19:59:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

    if (options.isCommentSet) {
      val comment = options.comment.toString
-      rdd.filter { line =>
+      lines.filter { line =>


Using untyped filter can be more performant here since we don't need to pay for the extra de/serialization costs:

lines.filter(length(trim($"value")) > 0 && $"value".startsWith(comment))

liancheng · 2016-11-18T20:00:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

        line.trim.nonEmpty && !line.startsWith(comment)
      }.first()
    } else {
-      rdd.filter { line =>
+      lines.filter { line =>


Same as above.

NathanHowell · 2016-11-22T18:06:53Z

Any thoughts on modifying JsonToStruct to support arrays (and options), then parsing could be something like:

dataset.select(
  Column(Inline(
    JsonToValue(
      ArrayType(schema),
      options,
      Column("value").expr))))

Likely out of scope for this pull request, but if there is a push to migrate from RDD[T] to Dataset[T] it would clean things up a bit.

…e-in-csv-and-json

SparkQA · 2016-11-30T20:40:39Z

Test build #69418 has finished for PR 15813 at commit b01a307.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-11-30T22:59:29Z

@NathanHowell, I've gone ahead and removed the JSON changes from this PR; now it only touches CSV and thus should not conflict with your work.

@liancheng, want to give this a final review? I've addressed your earlier comments.

rxin · 2016-12-03T05:14:00Z

Merging in master.

rxin · 2016-12-03T05:29:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

    } else {
      val charset = options.charset
-      sparkSession.sparkContext
-        .hadoopFile[LongWritable, Text, TextInputFormat](location)
+      val rdd = sparkSession.sparkContext


@JoshRosen do you know why the special handling for non-utf8 encoding is needed? I would think TextFileFormat itself already supports that since it is reading it in from Hadoop Text.

I'm not sure; I think this was a carryover from spark-csv.

cc @falaki
Can you chime in?

@rxin, I made a PR to address it at #29063 FYI.

## What changes were proposed in this pull request? This patch significantly improves the IO / file listing performance of schema inference in Spark's built-in CSV data source. Previously, this data source used the legacy `SparkContext.hadoopFile` and `SparkContext.hadoopRDD` methods to read files during its schema inference step, causing huge file-listing bottlenecks on the driver. This patch refactors this logic to use Spark SQL's `text` data source to read files during this step. The text data source still performs some unnecessary file listing (since in theory we already have resolved the table prior to schema inference and therefore should be able to scan without performing _any_ extra listing), but that listing is much faster and takes place in parallel. In one production workload operating over tens of thousands of files, this change managed to reduce schema inference time from 7 minutes to 2 minutes. A similar problem also affects the JSON file format and this patch originally fixed that as well, but I've decided to split that change into a separate patch so as not to conflict with changes in another JSON PR. ## How was this patch tested? Existing unit tests, plus manual benchmarking on a production workload. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15813 from JoshRosen/use-text-data-source-in-csv-and-json.

…sonDataSource ## What changes were proposed in this pull request? This PR proposes to use text datasource when Json schema inference. This basically proposes the similar approach in apache#15813 If we use Dataset for initial loading when inferring the schema, there are advantages. Please refer SPARK-18362 It seems JSON one was supposed to be fixed together but taken out according to apache#15813 > A similar problem also affects the JSON file format and this patch originally fixed that as well, but I've decided to split that change into a separate patch so as not to conflict with changes in another JSON PR. Also, this seems affecting some functionalities because it does not use `FileScanRDD`. This problem is described in SPARK-19885 (but it was CSV's case). ## How was this patch tested? Existing tests should cover this and manual test by `spark.read.json(path)` and check the UI. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#17255 from HyukjinKwon/json-filescanrdd.

…a different encoding ### What changes were proposed in this pull request? This PR proposes to use text datasource in CSV's schema inference. This shares the same reasons of SPARK-18362, SPARK-19885 and SPARK-19918 - we're currently using Hadoop RDD when the encoding is different, which is unnecessary. This PR completes SPARK-18362, and address the comment at #15813 (comment). We should better keep the code paths consistent with existing CSV and JSON datasources as well, but this CSV schema inference with the encoding specified is different from UTF-8 alone. There can be another story that this PR might lead to a bug fix: Spark session configurations, say Hadoop configurations, are not respected during CSV schema inference when the encoding is different (but it has to be set to Spark context for schema inference when the encoding is different). ### Why are the changes needed? For consistency, potentially better performance, and fixing a potentially very corner case bug. ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Existing tests should cover. Closes #29063 from HyukjinKwon/SPARK-32270. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…a different encoding This PR proposes to use text datasource in CSV's schema inference. This shares the same reasons of SPARK-18362, SPARK-19885 and SPARK-19918 - we're currently using Hadoop RDD when the encoding is different, which is unnecessary. This PR completes SPARK-18362, and address the comment at apache#15813 (comment). We should better keep the code paths consistent with existing CSV and JSON datasources as well, but this CSV schema inference with the encoding specified is different from UTF-8 alone. There can be another story that this PR might lead to a bug fix: Spark session configurations, say Hadoop configurations, are not respected during CSV schema inference when the encoding is different (but it has to be set to Spark context for schema inference when the encoding is different). For consistency, potentially better performance, and fixing a potentially very corner case bug. Virtually no. Existing tests should cover. Closes apache#29063 from HyukjinKwon/SPARK-32270. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of #47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., #29063, #15813, #17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

JoshRosen added 5 commits November 3, 2016 16:53

Use text data source in CSV and JSON data sources.

b697436

Don't check file existence.

cfb2f41

Fix name.

0fda0ec

Actually use lines

eb8ddfd

Clean up CSV code to reduce number of scans.

acce60d

HyukjinKwon reviewed Nov 9, 2016

View reviewed changes

Ensure that inferSchema is not called with empty set of files.

3082844

liancheng reviewed Nov 18, 2016

View reviewed changes

JoshRosen changed the title ~~[SPARK-18362][SQL] Use TextFileFormat in JsonFileFormat and CSVFileFormat~~ [SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat Nov 30, 2016

JoshRosen added 3 commits November 30, 2016 09:49

Use SQL filter to speed up first line query.

4d19978

Merge remote-tracking branch 'origin/master' into use-text-data-sourc…

d688f2d

…e-in-csv-and-json

Undo JSON changes.

b01a307

asfgit closed this in 7c33b0f Dec 3, 2016

rxin reviewed Dec 3, 2016

View reviewed changes

JoshRosen deleted the use-text-data-source-in-csv-and-json branch December 3, 2016 22:15

HyukjinKwon mentioned this pull request Mar 11, 2017

[SPARK-19918][SQL] Use TextFileFormat in implementation of TextInputJsonDataSource #17255

Closed

HyukjinKwon mentioned this pull request Jul 10, 2020

[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding #29063

Closed

This was referenced Jul 14, 2024

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

Closed

[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat #15813

[SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat #15813

JoshRosen commented Nov 8, 2016 •

edited

Loading

SparkQA commented Nov 8, 2016

HyukjinKwon Nov 9, 2016 •

edited

Loading

JoshRosen Nov 9, 2016

JoshRosen commented Nov 10, 2016

SparkQA commented Nov 10, 2016

JoshRosen commented Nov 15, 2016

SparkQA commented Nov 15, 2016

SparkQA commented Nov 17, 2016

liancheng Nov 18, 2016

liancheng Nov 18, 2016

NathanHowell commented Nov 22, 2016

SparkQA commented Nov 30, 2016

JoshRosen commented Nov 30, 2016

rxin commented Dec 3, 2016

rxin Dec 3, 2016

JoshRosen Dec 3, 2016

rxin Dec 3, 2016

HyukjinKwon Jul 10, 2020

[SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat #15813

[SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat #15813

Conversation

JoshRosen commented Nov 8, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 8, 2016

HyukjinKwon Nov 9, 2016 • edited Loading

Choose a reason for hiding this comment

JoshRosen Nov 9, 2016

Choose a reason for hiding this comment

JoshRosen commented Nov 10, 2016

SparkQA commented Nov 10, 2016

JoshRosen commented Nov 15, 2016

SparkQA commented Nov 15, 2016

SparkQA commented Nov 17, 2016

liancheng Nov 18, 2016

Choose a reason for hiding this comment

liancheng Nov 18, 2016

Choose a reason for hiding this comment

NathanHowell commented Nov 22, 2016

SparkQA commented Nov 30, 2016

JoshRosen commented Nov 30, 2016

rxin commented Dec 3, 2016

rxin Dec 3, 2016

Choose a reason for hiding this comment

JoshRosen Dec 3, 2016

Choose a reason for hiding this comment

rxin Dec 3, 2016

Choose a reason for hiding this comment

HyukjinKwon Jul 10, 2020

Choose a reason for hiding this comment

JoshRosen commented Nov 8, 2016 •

edited

Loading

HyukjinKwon Nov 9, 2016 •

edited

Loading